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1  Executive  Summary 

In  this  project,  to  achieve  our  goals  of  (i)  further  enhancing  capability  to  analyze 
unstructured  social  media  data  at  scale  and  rapidly,  and  (ii)  improving  IAI  social  media 
software  Scraawl’s  features,  we  have  accomplished  the  following. 

First,  we  designed  and  implemented  temporal  community  detection  and  influence  discovery 
algorithms  and  associated  visualizations  using  Twitter  data.  These  capabilities  improved  our 
understanding  of  how  features  associated  with  influence  and  communities  with  time  as  social 
conversations  about  a  particular  event  begin  to  grow. 

Second,  we  improved  Scraawl  UI  by  designing  and  implementing  (i)  bookmarking  and 
case  report  capability,  (ii)  user  profile  enhancements  such  as  linking  social  media 
accounts  with  Scraawl,  (iii)  translation  of  words/phrases  in  searches,  and  (iv)  a 
communication  methodology  in  the  UI  so  that  the  user  gets  notified  of  data  changes  in 
the  backend  for  improved  usability  and  navigation. 

Third,  we  improved  the  computational  framework  of  Scraawl  to  make  it  more  robust  and 
handle  higher  data  loads.  In  particular,  we  improved  (i)  the  messaging  architecture,  (ii) 
data  redundancy,  and  (iii)  the  service  availability. 

Fourth,  we  enhanced  Named  Entity  Recognition  (NER)  capabilities  of  Scraawl  by 
incorporating  (i)  part-of-speech  tagging  using  GATE,  (ii)  enhancing  the  gazetteers  with 
multilingual  entities,  and  (iii)  adding  multi-lingual  name  matching  capabilities. 

Fifth,  we  developed  analytics  that  evaluate  the  text  of  the  tweets  and  user  profiles  to 
extract  and  identify  location  information,  and  geo  reference  the  location  on  a  map.  We 
identified  locations  through  NER  and  displayed  geo-coordinates  (longitudes  and 
latitudes)  for  locations  recognized  by  NER.  We  then  provided  the  capability  to  locate 
locations  mentioned  by  tweets  on  a  map. 

Finally,  we  released  IAI  social  media  software  Scraawl  2.0  that  incorporates  these 
features. 
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2  Technical  Work 


2.1  Design  and  Implementation  of  Temporal  Analytics 

Based  on  our  capabilities  to  discover  influential  users  and  community  detection  from 
Twitter,  we  designed  temporal  analysis  algorithms  covering  all  aspects  of  front  end,  back 
end,  and  services.  In  particular,  inputs  to  these  algorithms  include  the  Tweet  dataset  (i.e., 
Scraawl  report)  and  number  of  time  fractions  (e.g.,  4).  Based  on  these  user  provided 
parameters  Scraawl  partition  Twitter  data  into  different  time  fractions  (according  to  tweet 
post  times),  run  Twitter  analytics  over  each  time  fraction,  and  display  how  Twitter 
analytics  results  change  over  time.  A  representative  analysis  is  shown  in  Figure  1 . 


Influence  Detection:  Temporal  Analysis 


Top  Influential  users  8:30- 9:30.  Tweets  8914 


] 


Top  Influential  Users 


MM 


Top  Influential  Hashtags 


□ 


To  Influential  users  10:30-  11:30.  Tweets:  14006 
Notice  influence  of  @koreyaba_www,  (Smor^maew 
appear.  They  were  nm  there  in  the  first  hi 


Figure  1:  Temporal  Analysis  of  Influential  Users  in  a  Report. 


In  particular,  we  implemented  a  smart  data  binning  for  influential  nodes  analytical 
capabilities  for  twitter  data  source  to  evaluate  the  temporal  evolution  of  this  analytical 
capability. 


Our  lightweight  temporal  software  implementation  is  described  as  follows: 


1.  We  extended  the  influential  nodes  web  service  to  have  the  time  period 
(IntervallnMinutes)  for  a  bin  in  minutes  as  a  parameter.  The  caller  of  the  web 
service  can  run  the  analytic  if  he  provides  the  value  zero  to  the  IntervallnMinutes, 
otherwise  the  caller  will  run  the  temporal  analytics. 

2.  We  modified  the  tweets  database  extraction  routine  to  pull  the  timestamps  of  the 
time  of  tweet  creation  for  every  tweet  as  reported  by  twitter. 

3.  We  developed  a  new  smart  binning  routine  that  has  as  input  the  StartingTime, 
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EndingTime,  and  an  IntervallnMin : 

a.  We  retrieve  hour  and  minutes  from  the  timestamp  of  the  first  tweet  in  the 
report. 

b.  We  calculate  the  next  EndingTime  of  the  bin  by  finding  the  next  minute  to 
that  satisfy  the  following  equation  m=  K  *  IntervallnMinutes  among  all 
the  minutes  in  the  first  hour  of  the  social  media  data  report. 

c.  We  compute  all  the  bins  by  shifting  the  StartingTime  and  EndingTime  of 
the  first  bin  IntervallnMinutes  until  the  StartingTime  is  greater  than  the 
last  tweet  time  in  the  report 

d.  We  replace  the  EndingTime  of  the  last  bin  with  the  last  time  of  a  tweet  in 
the  report 

4.  The  posts  are  divided  into  batches  and  grouped  by  report_id  and  an  identical  time 
period  using  a  smart  data  binning  model  approach.  We  developed  a  routine  to 
return  all  the  tweets  in  each  bin  such  as  the  time  t  of  a  tweet  satisfy  the  following 
condition:  Starting _Time_Of_Bin  <=  t  <  End_Time_Of_Bin 

5.  We  developed  a  new  routine  to  store  the  application  results  along  with  additional 
fields  =  {  StartingTime  of  each  bin,  EndingTime  of  each  bin,  and  the  BinNumber } 
in  a  specific  table  of  MySQL  database.  We  extended  the  analytics  influence 
results  table  in  MySQL  database  and  the  Influence  model  to  address  the  storage  of 
the  additional  fields. 

6.  We  run  our  influence  discovery  application  on  those  batches  of  social  media  data 
within  the  report  and  we  store  the  results  in  MySql  database. 

7.  On  the  GUI,  we  developed  a  scrolled  list  of  temporal  batches  for  the  user  to  see 
the  evolution  of  the  analytical  results. 

Example  web  service  request  for  temporal  influential  nodes: _ 

{ 

"topk":  "10", 

"topn "10", 

“IntervallnMin  ”15  ”, 

"promise  "http://www.  google,  com  ", 

"error  "http://www. google,  com  " 

} 


Example  web  service  response  for  temporal  influential  nodes: 


{ 

"status":  "OK", 

"message":  "Analytics  completed" 

1 _ 


Example  web  service  request  for  influential  nodes  on  all  tweets: 

{ 

"topk":  "10", 

"topn":  "10", _ 
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“IntervallnMin  ”0  ”, 

"promise  "http://www. google,  com  ", 
"error  "http://www.  google,  com  " 

} 


2.2  Upgrade  Scraawl  User  Interface  (UI) 

We  have  enhanced  the  responsive  UI  and  added  features  to  Scraawl  UI,  which  are 
detailed  below. 

2.2.1  Bookmarking  and  Case  Report  Capability 

We  have  added  a  capability  to  bookmark  a  set  of  tweets  and  export  them  into  a  case 
folder,  which  is,  for  all  practical  purposes,  another  Scraawl  report.  The  case  reports  are 
added  part  of  “My  Folders”  in  the  “My  Reports”  screen  and  are  shown  along  with 
“Shared  Reports,”  “Reports  Shared  with  me,”  and  “Archived  Reports”  (see  Figure  2). 


1  My  Folders  1 

Name 

Reports 

<  Shared  Reports 

1 

:£  Reports  Shared  With  Me 

0 

H  Archived  Reports 

28 

Case  Reports 

1 

_ 

Figure  2:  “Case  Reports”  under  “My  Folders.” 


A  user  can  bookmark  a  tweet  or  a  set  of  tweets  and  then  export  these  bookmarked  tweets 
into  a  Case  Report.  The  most  convenient  way  to  bookmark  all  tweets  is  to  make  a  search 
under  “Raw  Data”  and  use  “Bookmark  ALL  matching”  under  the  “Bookmark”  dropdown 
menu.  Once  a  tweet  is  bookmarked,  a  bookmark  symbol  is  shown  to  the  right  of  the  tweet 
before  the  date  column. 

The  bookmarked  tweets  can  then  be  exported  into  a  “Case  Report”  by  using  the  “Export 
Bookmarked”  menu  item  under  the  “Actions”  dropdown  menu  (see  Figure  3). 


Status:  Stopped  Posts:  38,01 2  Updated:  27  Nov,  201518:01  Timeline:  27 Nov,  201502:38  -  27 Nov,  2015  18:01  0  Actions  »  | 

I?]  Clone  Search 

1  QT  Edit  Report 

D  Undo  All  Edits 

Tweets 

«  First  <  Prev  Page:  1/1901  Next>  Last* 

Search 

^^Export  Bookmarked^^ 

Tip:  Use  operators  for  advanced  search  capabilities 

<  Share/Unshare 

1 _ p  t  mqicn  g  y  »  )  |  K 

■  Archive  Report 

Q  Tweets  from  *  twrtter.com  H 

S  Delete  Report 

Figure  3:  “Export  Bookmarked”  menu  item. 
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2.2.2  Enhancement  of  User  Profiles 

The  user  profile  screen  has  been  enhanced  with  additions  to  limits  for  data  access  and 
managed  access  to  analytics.  Specific  submenus  are  as  follows. 

•  Basic  Information:  Basic  user  information  such  as  name  and  e-mail  is  shown 
here.  Users  can  also  change  their  password  and  change  their  time  zone.  If  a  user 
belongs  to  a  group  then  that  information  is  shown  here  as  well. 

•  Linked  Accounts:  In  addition  to  linking  Twitter  accounts,  users  can  now  link 
their  Instagram  accounts.  This  is  encouraged  because  users  can  use  their  token  (as 
opposed  to  shared  IAI  token)  when  using  public  APIs  to  collect  data. 

•  Report  Limits:  Number  of  limits  for  search,  active  search,  user,  shared,  archived 
search,  archived  user  and  case  reports  are  shown  here.  These  limits  are  based  on 
user  privileges. 

•  Twitter  Limits:  All  limits  and  privileges  regarding  Twitter  data  collection  is 
shown  here. 

•  Instagram  Limits:  All  limits  and  privileges  regarding  Instagram  data  collection 
is  shown  here. 

•  Analytics  Limits:  The  availability  of  analytics  for  Twitter  and  Instagram  is 
shown  here.  These  privileges  are  based  on  user  or  group. 

•  Manage  Account:  Users  can  delete  their  account,  in  which  case  their  data  is 
deleted  from  IAI  servers  as  well. 

2.2.3  Translation  of  Words/Phrases  in  Searches 

We  enabled  a  translation  capability  directly  in  search  screen,  where  users  can  input  their 
words  and  phrases  in  one  of  the  90  languages,  translate  into/from  them,  and  start 
searching.  This  capability  is  provided  in  both  basic  and  advanced  searches. 

2.2.4  Usability  and  Navigation  Improvements 

We  have  also  made  improvements  to  the  usability  and  navigation.  We  improved  the 
communication  methodology  in  the  UI  so  that  the  front  end  user  interface  gets  notified  of 
data  changes  in  the  backend.  In  particular  we  improved  the  efficiency  of  the  JavaScript 
libraries  that  communicate  with  the  backend  data  and  analytics  services  and  display  the 
result  to  the  end  user.  We  made  the  front  end  user  interface  more  efficient,  responsive, 
and  dynamic  for  the  end  user.  This  included  making  the  UI  render  faster,  the  data  fields 
being  updated  more  frequently  in  an  efficient  manner.  We  also  focused  on  improving  the 
user  interface  for  mobile  devices  by  better  adapting  to  the  specification  of  the  device  that 
is  requesting  the  data.  We  also  improved  the  visualization  of  the  analytics  results  so  that 
the  exploration  of  the  results  is  more  intuitive  and  accessible. 
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We  also  worked  on  improving  the  navigation  process  throughout  the  application  so  that 
mobile  users  can  better  interact  with  the  interface.  We  reduced  the  number  of  steps  a  user 
may  have  to  perform  to  achieve  the  desired  outcome,  in  particular,  on  mobile  devices.  An 
example  is  improved  “breadcrumbs”  throughout  Scraawl.  We  improve  the  user’s 
contextual  awareness  throughout  the  application  by  improving  the  user’s  ability  to 
navigate  within  and  between  data  and  analytical  activities. 

2.3  Upgrade  Scraawl  Computational  Framework  to  Increase  Robustness 

In  this  task  we  improved  the  computational  framework  of  Scraawl  to  make  it  more  robust 
and  handle  higher  data  loads.  Figure  4  shows  the  different  components  that  are  part  of  the 
Scraawl  computational  framework,  and  the  section  of  the  framework  that  was  upgraded  is 
highlighted  in  the  figure.  As  part  of  this  task  we  focused  on  improving  the  (i)  messaging 
architecture,  (ii)  data  redundancy,  and  (iii)  service  availability. 


*LB/RL  =Load  Balancer  &  Rate  Limiter 

Figure  4:  Scraawl  computational  framework. 


In  particular,  we  improved  the  architecture  that  handles  the  tasking  and  messaging  aspect 
of  the  computational  framework.  This  involved  making  the  message  queues  highly 
available,  providing  improved  task  delegation  intelligence,  and  improved  monitoring  of 
the  message  queues.  For  data  redundancy  we  added  more  hardware  to  support  replicated 
data  storage.  We  also  improved  the  availability  of  the  application  by  providing  redundant 
standby  data  nodes.  To  improve  the  service  availability,  we  improved  the  monitoring 
intelligence  of  the  computational  nodes.  We  introduced  redundant  stand  by 
computational  nodes  that  can  be  added  to  the  production  system  to  handle  unexpected 
service  failures.  This  reduced  the  downtime  and  added  robustness  to  the  computationally 
heavy  analytic  services. 
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2.4  Enhance  Named  Entity  Recognition  (NER) 

We  enhanced  Scraawl’s  NER  capabilities  of  (i)  resolving  a  large  set  of  names, 
organizations,  and  places  in  English,  and  (ii)  expanding  abbreviations,  e.g.,  UN  to  United 
Nations.  In  this  task,  we  have  made  the  following  improvements  to  the  Scraawl  NER 
module. 

Incorporated  GATE  Part-of-Speech  (POS)  tagging:  We  have  started  using  General 
Architecture  for  Text  Engineering  (GATE)  software’s  [1]  English  POS  tagger  as  part  of 
Scraawl  NER  module.  GATE  is  an  open  source  software  to  do  many  common  task 
related  to  Natural  Language  Processing.  Its  POS  tagger  [2]  is  a  modified  version  of  the 
Brill  tagger,  which  produces  a  part-of-speech  tag  as  an  annotation  on  each  word  or 
symbol.  The  current  NER  development  software  uses  classifies  a  word  as  an  entity  if  and 
only  if  the  word  is  one  of  the  gazetteers  and  its  POS  tag  is  Noun. 

Enhanced  the  gazetteers  and  incorporated  multi-lingual  name  matching 
capabilities:  We  have  enhanced  the  gazetteers  by  including  open  source  JRC-Names 
dictionary  [3],  and  NGA  Geographical  Names  Database  [4],  With  the  compiled 
dictionaries,  we  have  added  the  capability  of  resolving  (i)  1.18+  million  persons,  (ii) 
6700+  organizations,  and  (iii)  virtually  every  town/city  in  both  English,  in  their  native 
languages,  and  additional  common  languages. 

2.5  Incorporate  Geo-reference  Analytics 

We  developed  analytics  that  evaluate  the  text  of  the  tweets  and  user  profiles  to  extract 
and  identify  location  information,  and  geo  reference  the  location  on  a  map.  We  identifed 
locations  through  NER  and  displayed  geo-coordinates  (longitudes  and  latitudes)  for 
locations  recognized  by  NER.  We  then  provided  the  capability  to  locate  locations 
mentioned  by  tweets  on  a  map. 

We  also  added  MapBox,  which  is  a  mapping  and  searching  service  that  is  built  on  vector 
maps  and  rendered  in  real-time,  and  made  it  available  as  part  of  Scraawl’ s  geospatial 
analytics.  A  screenshot  of  MapBox  screen  is  shown  in  Figure  5,  where  geo-coded,  geo- 
referenced,  and  geo-profiled  tweets  are  shown.  In  addition,  click/drag/filter  (selecting  the 
square  icon  and  by  drawing  rectangles  with  mouse),  zoom  in/out  (either  with  the  +/- 
buttons  or  mouse  wheel),  home  zooming,  and  editing/deleting  selected  regions  are 
available.  In  addition,  using  the  “magnifying  glass”  icon,  a  user  can  search 
locations/places,  and  the  map  automatically  zooms  into  the  selected  region. 
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Figure  5:  MapBox  screen. 
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3  Conclusion 

With  the  algorithms  implemented  and  improvements  made  to  both  software  and  UI,  users 
will  have  an  enhanced  information  tailoring  capability  using  Scraawl.  This  will  enable 
users  to  better  answer  questions  about  key  actors,  topics,  events,  communities, 
sentiments,  discourses  etc.  using  social  media.  In  addition,  users  will  have  more 
capabilities  to  create  their  own  workflows  by  incorporating  searching,  filtering,  and 
analytics  in  any  order  and  as  many  times  as  possible.  For  example,  with  temporal 
analytics,  users  can  first  find  influential  users  in  two  different  time  periods,  and  then  filter 
into  the  desired  time  period.  The  user  can  run  geo-referencing  only  on  the  filtered  time 
period  effectively  increasing  the  signal  to  noise  ratio. 
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